Locating Noun Phrases with Finite State Transducers

نویسنده

  • Jean Senellart
چکیده

We present a method for constructing, maintaining and consulting a database of proper nouns. We describe noun phrases composed of a proper noun and/or a description of a human occupation. They are formalized by finite state transducers (FST) and large coverage dictionaries and are applied to a corpus of newspapers. We take into account synonymy and hyperonymy. This first stage of our parsing procedure has a high degree of accuracy. Wc show how we can handle requests such as: 'Find all newspaper articles in a general corl)us mentioning the French prime minister', or 'How is Mr. X referred to in the corpus; what have been his different occupations through out the period over which our corpus extends?' In the first case, non trivial occurrences of noun phrases are located, that is phrases not containing words present in the request, but either synonyms, or proper nouns relevant to request. The results of the search is far bet ter than than those obtained by a key-word based engine. Most answers are correct: except some cases of honmnymy (where a human reader would also fail without more context). Also, the treatment of people having several different occupations is not frilly resolved. Wc have built for Prenctl, a library of about one thousand such FSTs, and English FSTs are under construction. The same method can bc used to locate and propose new proper nouns, sin> ply by replacing given proper names in the same FSTs by variables. 1 I n t r o d u c t i o n Information Retrieval in full tcxts is one of the challenges of the next years. Web engines attempt to select among the millions of existing Web Sites, those corresponding to some input request. Newspaper archives is another exam1212 ple: there are several gigabytcs of news on electronic support , and the size is increasing evcry day. Different approaches have been proposed to retrieve precise information in a large database of natural texts: 1. Key-words algorithms (e.g. Yahoo): cooccurrences of the different words of tile request are searched for in one same docmnent. Generally, slight variations of spelling are allowed to take into account grammatical endings and typing errors. 2. Exact pat tern algorithms (e.g. OED): sequences containing occurrences described by a regular expression on characters are located. 3. Statistical algorithms (e.g. LiveTopic): they offer to the user documents containing words of the request and also words that are statistically and semantically close with respect of clustering or factorial analysis. The first method is the simplest one: it generally provides results with an important noise (documents containing homographs of the words of the request, not in relation with the request, or documents containing words that have a form very close to that of the request, but with a different meaning). The second method yields excellent results, to the extent that the pat tern of the request is sufficiently complex, and thus allows specification of synonymous forms. Also, the different grainmatical endings can be described precisely. The drawback of such precision is the difficulty to build and handle complex requests. The third approach can provide good results for a very simple request. But, as any statistical method, it needs docmnents of a huge size, and thus, cannot take into account words occurring a limited mtmber of times in the database, which is the case of roughly one word out of two, according Zipf's law I (Zipf, 1932). We are particularly interested in finding noun phrases containing or referring to proper nouus, in order to answer the following requests: 1. Who is John Major? 2. Find all document referring to John Major. 3. Find all people, who have been French ministers of culture. With the key-word method, texts containing the sequence 'John Major 'are found, but also, texts containing 'a UN PTvtection Force, Major Rob Anninck', 'P. Major', 'a former Long Islander, John Jacques' and 'Mr. Major'. The statistical approach will probably succeed (supposing the text is large enough) in associating the words John Major, with tim words Britain, prime and minister. Therefore, it would provide documents containing the sequence 'the prime minister, John Major', but also 'the French prime minister' or 'Timothy Eggar, Britain's energy minister' which have exactly the same number of correctly associated words. Such answers are an inevitable consequence of any method not grammatically founded. M. Gross and J. Senellart (1998) have proposed a preprocessing step of the text which groups up to 50 % of the words of the text into compound utterances. By hiding irrelevant meanings of simple words which are part of compounds, they obtain more relevant tokens. In the preceding example, the minimal tokens would be the conspound nouns 'prime minister' or 'energy minister', thus, tile statistical engine could slot have misinterpreted the word 'minister' in 'energy minister' and in 'prime minister'. We propose here a new method based on a forreal and full description of the specific phrases actually used to describe occupations. We also use large coverage dictionaries, and libraries of general purpose finite state transducers. Our algorithm finds answers to questions of types 1, 2 and 3, with nearly no errors disc to silence, or to noise. The few cases of remaining errors are treated in section 5 and we show, that in order to avoid them by a general method, one must i)erfo,'m a complete syntactic analysis of 1 This is true whatever the size of the database is. tile sentence. Our algorithIn has three different applications. First, by using dictionaries of proper nouns and local grammars describing occupations, it answers requests. Synonylns and hyponyms are formally treated, as well as the chronological evolution of the corpus. By consulting a preprocessed index of the database, it provides results in real time. The second application of the algorithm consists in replacing proper nouns in FSTs by variables, and use them to locate and propose to tile user new proper nouns slot listed in dictionaries. In this way, tile construction of the library of FSTs and of tile dictionaries can be automated at least in part. The third application is automatic translation of such noun phrases, by constructing the equivalent transducers in the different languages. In section 2, wc provide the formal description of the problem, and we show how we can use autonmton representations. Ill section 3, we show how we can handlc requests. In section 4, we give some examples. In section 5, we analyze failed answers. In section 6, we show how we use transducers to enrich a dictionary. 2 F o r m a l D e s c r i p t i o n We deal with noun phrases containing a description of all occupation, a proper noun, or both combined. For example, 'a senior RAF officer', 'Peter Lilley, the Shadow Chancellor', 'Sir 7~renee Burns, the 75"easury Permanent Secretary' or 'a former Haitian prime minister, Rosny Smarth'. For our purl)ose, we must have a formal way of describing and identifying such sequences. 2.1 Description of occupations We describe occupations by means of local grammars, which are directly written in the form of FS graphs. These graphs are equivalent to FSTs with inverted representation (FST) (Roche and Sehabes, 1997) as in figure 1, where each box rel)resents a transition of the automaton (inl)ut of the transducer), and the label under a box is an output of the transducer. The initial state is the left arrow, the final state is the double square. The optional grey boxes, (cf figure 2), represent sub-transducers: in other words, by 'zoonfing' on all sub-transducers, we view a given FST as a simple graph, with no parts singled out. However, we insist on

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Tools For Locating Noun Phrases With Finite State Transducers

The processes of retrieving and describing information in full texts are very close, at least for the first user (developer). We describe here practical tools that allow construction of an FST database which describes and then to retrieve nominal phrases composed with a sequence denoting an occupation and a proper noun,

متن کامل

Application of finite-state transducers to the acquisition of verb subcategorization information

This paper presents the design and implementation of a finite-state syntactic grammar of Basque that has been used with the objective of extracting information about verb subcategorization instances from newspaper texts. After a partial parser has built basic syntactic units such as noun phrases, prepositional phrases, and sentential complements, a finite-state parser performs syntactic disambi...

متن کامل

An Algorithm for finding Noun Phrase Correspondences in Bilingual Corpora

94304 Abstract The paper describes an algorithm that employs English and French text taggers to associate noun phrases in an aligned bilingual corpus. The tag-gets provide part-of-speech categories which are used by finite-state recognizers to extract simple noun phrases for both languages. Noun phrases are then mapped to each other using an iterative re-estimation algorithm that bears similari...

متن کامل

Pre-Noun Modifiers as a Means to Describe Product Characteristics

The article deals with the characteristics of noun phrases functioning in pre-position. Pre-noun modifiers, or attributive modifiers, are among the high-capacity structures in the contemporary English language, as they enable to devoid of prepositions and to describe objects in a laconic way. The goal of this research was to analyze the way scholars interpret this linguistic structure today, as...

متن کامل

Extracting Noun Phrases from Large-Scale Texts: A Hybrid Approach and its Automatic Evaluation

phrases. The partial parser is motivated by an intuition (Abney, 1991): To acquire noun phrases from running texts is useful for many applications, such as word grouping, terminology indexing, etc. The reported literatures adopt pure probabilistic approach, or pure rule-based noun phrases grammar to tackle this problem. In this paper, we apply a probabilistic chunker to deciding the implicit bo...

متن کامل

Quantification and Implication in Semantic Calendar Expressions Represented with Finite-State Transducers

This paper elaborates a model for representing semantic calendar expressions (SCEs), which correspond to the intensional meanings of natural-language calendar phrases. The model uses finite-state transducers (FSTs) to mark denoted periods of time on a set of timelines represented as a finite-state automaton (FSA). We present a treatment of SCEs corresponding to quantified phrases (any Monday; e...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1998